• Blog
  • Teaching
    • Jacob’s Teaching Experience
    • Introduction to R
    • Introduction to Git/GitHub
    • API 222 Section Material
  • Research

Module 8: Iteration

In this lesson, we learn how to implement looping techniques to minimize repeated code and improve efficiency.

Author

Jacob Jameson

Iteration

Download a copy of Module 8 slides

Lab 8

General Guidelines:

You will encounter a few functions we did not cover in the lecture video. This will give you some practice on how to use a new function for the first time. You can try following steps:

  1. Start by typing ?new_function in your Console to open up the help page
  2. Read the help page of this new_function. The description might be too technical for now. That’s OK. Pay attention to the Usage and Arguments, especially the argument x or x,y (when two arguments are required)
  3. At the bottom of the help page, there are a few examples. Run the first few lines to see how it works
  4. Apply it in your lab questions

It is highly likely that you will encounter error messages while doing this lab Here are a few steps that might help get you through it.

  1. Locate which line is causing this error first
  2. Check if you may have a typo in the code. Sometimes another person can spot a typo faster than you.
  3. If you enter the code without any typo, try googling the error message
  4. Scroll through the top few links see if any of them helps
  5. Try working on the next few questions while waiting for answers by TAs

Warm-up

Recall, for-loops are an iterator that help us repeat tasks while changing inputs. The most common structure for your code will look like the following code. This can be simplified if you are not storing results.

# what are you iterating over? The vector from -10:10
items_to_iterate_over <- c(-10:10) 

# pre-allocate the results
out <- rep(0, length(items_to_iterate_over))
# write the iteration statement --
# we'll use indices so we can store the output easily 
for (i in seq_along(items_to_iterate_over)) {
# do something
# we capture the median of three random numbers
# from normal distributions various means
    out[[i]] <- median(rnorm(n = 3,mean = items_to_iterate_over[[i]]))
    }

Writing for-loops

  1. Write a for-loop that prints the numbers 5, 10, 15, 20, 250000.
  2. Write a for-loop that iterates over the indices of x and prints the ith value of x.
x <- c(5, 10, 15, 20, 250000)
# replace the ... with the relevant code

for (i in ... ){ 
  print(x[[...]])
  }
  1. Write a for-loop that simplifies the following code so that you don’t repeat yourself! Don’t worry about storing the output yet. Use print() so that you can see the output. What happens if you don’t use print()?
sd(rnorm(5))
sd(rnorm(10))
sd(rnorm(15))
sd(rnorm(20)) 
sd(rnorm(25000))
  1. adjust your for-loop to see how the sd changes when you use rnorm(n, mean = 4)
  2. adjust your for-loop to see how the sd changes when you use rnorm(n, sd = 4)
  1. Now store the results of your for-loop above in a vector. Pre-allocate a vector of length 5 to capture the standard deviations.

vectorization vs for loops

Recall, vectorized functions operate on a vector item by item. It’s like looping over the vector! The following for-loop is better written vectorized.

Compare the loop version

names <- c("Marlene", "Jacob", "Buddy")
out <- character(length(names))

for (i in seq_along(names)) {
  out[[i]] <- paste0("Welcome ", names[[i]])
  }

to the vectorized version

names <- c("Marlene", "Jacob", "Buddy") 
out <- paste0("Welcome ", names)

The vectorized code is preferred because it is easier to write and read, and is possibly more efficient.

  1. Rewrite your first for-loop, where you printed 5, 10, 15, 20, 250000 as vectorised code

  2. Rewrite this for-loop as vectorized code:

radii <- c(0:10)

area <- double(length(radii)) 

for (i in seq_along(radii)) { 
  area[[i]] <- pi * radii[[i]] ^ 2 
  }
  1. Rewrite this for-loop as vectorized code:
radii <- c(-1:10)

area <- double(length(radii))

for (i in seq_along(radii)) { 
  if (radii[[i]] < 0) { 
    area[[i]] <- NaN 
  } else {
      area[[i]] <- pi * radii[[i]] ^ 2 }
  }

Extension

Simulating the Law of Large Numbers

The Law of Large Numbers says that as sample sizes increase, the mean of the sample will approach the true mean of the distribution. We are going to simulate this phenomenon!

We’ll start by making a vector of sample sizes from 1 to 50, to represent increasing sample sizes.

Create a vector called sample_sizes that is made up of the numbers 1 through 50. (Hint: You can use seq() or : notation).

We’ll make an empty tibble to store the results of the for loop:

estimates <- tibble(n = integer(), sample_mean = double())

Write a loop over the sample_sizes you specified above. In the loop, for each sample size you will:

  1. Calculate the mean of a sample from the random normal distribution with mean = 0 and sd = 5.
  2. Make an intermediate tibble to store the results
  3. Append the intermediate tibble to your tibble using bind_rows().
set.seed(60637) 
for (___ in ___) {
  # Calculate the mean of a sample from the random normal distribution with mean = 0 and s
  ___ <- ___
  # Make a tibble with your estimates
  this_estimate <- tibble(n = ___, sample_mean = ___) 
  # Append the new rows to your tibble
  ___ <- bind_rows(estimates, ___)
  }

We can use ggplot2 to view the results. Fill in the correct information for the data and x and y variables, so that the n column of the estimates tibble is plotted on the x-axis, while the sample_mean column of the estimates tibble is plotted on the y-axis.

# your data goes in the first position
___ %>%
  ggplot(aes(x = ___, y = ___)) + 
  geom_line()
  1. As the sample size (n) increases, does the sample mean becomes closer to 0, or farther away from 0?

Rewrite the loop code without looking at your previous code and use a wider range of sample sizes. Try several different sample size combinations. What happens when you increase the sample size to 100? 500? 1000? Use the seq() function to generate a sensibly spaced sequence.

set.seed(60637) 
sample_sizes <- ___ 
estimates_larger_n <- ___

for (___ in ___) {
  ___ <- ___
  ___ <- ___
  ___ <- ___
}

___ %>%
  ggplot(___(___ = ___, ___ = ___)) +
  geom_line()
  1. How does this compare to before?

Extending Our Simulation

Looking at your results, you might think a small sample size is sufficient for estimating a mean, but your data had a relatively small standard deviation compared to the mean. Let’s run the same simulation as before with different standard deviations.

Do the following:

  1. Create a vector called population_sd of length 4 with values 1, 5, 10, and 20 (you’re welcome to add larger numbers if you wish).

  2. Make an empty tibble to store the output. Compared to before, this has an extra column for the changing population standard deviations.

  3. Write a loop inside a loop over population_sd and then sample_sizes.

  4. Then, make a ggplot graph where the x and y axes are the same, but we facet (aka we create small multiples of individual graphs) on population_sd.

set.seed(60637)
population_sd <- ___
# use what every you came up with in the previous part 
sample_sizes <- ___
estimates_adjust_sd <- ___
for (___ in ___){ 
  for (___ in ___) {
    ___ <- ___
    ___ <- ___
    ___ <- ___
  } 
}

___ %>%
  ggplot(___) +
  geom_line() + 
  facet_wrap(~population_sd) + 
  theme_minimal()

How do these estimates differ as you increase the standard deviation?

Want to improve this tutorial? Report any suggestions/bugs/improvements on here! We’re interested in learning from you how we can make this tutorial better.

Copyright 2024, Jacob Jameson